-
Notifications
You must be signed in to change notification settings - Fork 170
feat: Add support for nw.int_range for eager backends
#2895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hey @FBruzzesi, as I've mentioned 8347839 times, I'm very keen to see this in Would you be okay if we let this one marinate for a bit and maybe target the release after the next? π |
π
Yeah I was not expecting this to be finalized by tomorrow. It has some sharp edges that need some love and work |
|
@MarcoGorelli any "trick" to avoid the custom check in precommit?
I thought about emulating that |
You could add this to IntegerDType: TypeAlias = "dtypes.IntegerType | type[dtypes.IntegerType]" |
#2982 removed from everywhere else
> TypeError: argument 'step': 'PolarsExpr' object cannot be interpreted as an integer https://github.com/narwhals-dev/narwhals/actions/runs/17499840563/job/49709898155?pr=2895
@MarcoGorelli pinging again as (#2895 (comment)) was a month ago π«£ |
|
@FBruzzesi I still have hope that eventually we'll land this π I have an idea of how we could reuse IdeaAs an example, if we convert the output of import polars as pl
import pyarrow as pa
start, end = dt.date(2000, 1, 1), dt.date(2000, 1, 5)
expr = pl.date_range(start, end).alias("date")
table = pl.select(expr).to_arrow()
table.column("date").typeSee Since that is just represented as a 32-bit integer, we can do things like: dates = table.column("date")
datesShow Output
dates.cast(pa.int32())Show Output
dates.cast(pa.int32()).cast(pa.date32())Show Output
In practiceimport datetime as dt
import pyarrow as pa
def date_range(
start: dt.date,
end: dt.date,
interval: int, # (* assuming the `Interval` part is solved)
*,
closed: ClosedInterval = "both",
) -> pa.Date32Array:
start_i = pa.scalar(start).cast(pa.int32()).as_py()
end_i = pa.scalar(end).cast(pa.int32()).as_py()
# call `int_range` here for the compatibility branch
arr = pa.arange(start_i, end_i + 1, interval)
if closed != "both":
if closed == "left":
arr = arr.slice(length=len(arr) - 1)
elif closed == "none":
arr = arr.slice(1, len(arr) - 1)
else:
arr = arr.slice(1)
# the first cast would happen in `int_range(dtype=...)`
return arr.cast(pa.int32()).cast(pa.date32())
start, end = dt.date(2000, 1, 1), dt.date(2000, 2, 1)
date_range(start, end, interval=7, closed="none")
<pyarrow.lib.Date32Array object at 0x00000296A74176A0>
[
2000-01-08,
2000-01-15,
2000-01-22,
2000-01-29
]What's the catch?We'd need to adapt Show
|
| class Interval: | |
| def __init__(self, multiple: int, unit: IntervalUnit, /) -> None: | |
| self.multiple: int = multiple | |
| self.unit: IntervalUnit = unit | |
| def to_timedelta( | |
| self, *, unsupported: Container[IntervalUnit] = frozenset(("ns", "mo", "q", "y")) | |
| ) -> dt.timedelta: | |
| if self.unit in unsupported: # pragma: no cover | |
| msg = f"Creating timedelta with {self.unit} unit is not supported." | |
| raise NotImplementedError(msg) | |
| kwd = UNIT_TO_TIMEDELTA[self.unit] | |
| # error: Keywords must be strings (bad mypy) | |
| return dt.timedelta(**{kwd: self.multiple}) # type: ignore[misc] | |
| @classmethod | |
| def parse(cls, every: str) -> Interval: | |
| multiple, unit = cls._parse(every) | |
| if unit == "mo" and multiple not in MONTH_MULTIPLES: | |
| msg = f"Only the following multiples are supported for 'mo' unit: {MONTH_MULTIPLES}.\nGot: {multiple}." | |
| raise ValueError(msg) | |
| if unit == "q" and multiple not in QUARTER_MULTIPLES: | |
| msg = f"Only the following multiples are supported for 'q' unit: {QUARTER_MULTIPLES}.\nGot: {multiple}." | |
| raise ValueError(msg) | |
| if unit == "y" and multiple != 1: | |
| msg = ( | |
| f"Only multiple 1 is currently supported for 'y' unit.\nGot: {multiple}." | |
| ) | |
| raise ValueError(msg) | |
| return cls(multiple, unit) | |
| @classmethod | |
| def parse_no_constraints(cls, every: str) -> Interval: | |
| return cls(*cls._parse(every)) | |
| @staticmethod | |
| def _parse(every: str) -> tuple[int, IntervalUnit]: | |
| if match := PATTERN_INTERVAL.match(every): | |
| multiple = int(match["multiple"]) | |
| unit = cast("IntervalUnit", match["unit"]) | |
| return multiple, unit | |
| msg = ( | |
| f"Invalid `every` string: {every}. Expected string of kind <number><unit>, " | |
| f"where 'unit' is one of: {get_args(IntervalUnit)}." | |
| ) | |
| raise ValueError(msg) |
It would be restricted to only d, w, mo, q, y - but even within that - maybe more flexible given that we know +1d is equivalent to +1 π€?
| @unstable | ||
| def int_range( | ||
| start: int | Expr, | ||
| end: int | Expr | None = None, | ||
| step: int = 1, | ||
| *, | ||
| dtype: IntegerDType = Int64, | ||
| eager: IntoBackend[EagerAllowed] | Literal[False] = False, | ||
| ) -> Expr | Series[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note
I don't think this blocks anything - just a realization from me π
I hadn't been able to put my finger on what seemed off to me wrt this until just now:
Polars can do
eager: bool. In our case it's a bit more complex than that.
Instead of adding yet another argument to the function, I allowed foreagerto be False (default), None or the backend/implementation that should back the series.
This trick would work for us in all the cases where eager defines Expr | Series.
That is most of them, but there is polars.select as the ugly duckling with:
DataFrame | LazyFrame
@overload
def select(
*exprs: IntoExpr | Iterable[IntoExpr],
eager: Literal[True] = ...,
**named_exprs: IntoExpr,
) -> DataFrame: ...
@overload
def select(
*exprs: IntoExpr | Iterable[IntoExpr],
eager: Literal[False],
**named_exprs: IntoExpr,
) -> LazyFrame: ...
def select(
*exprs: IntoExpr | Iterable[IntoExpr], eager: bool = True, **named_exprs: IntoExpr
) -> DataFrame | LazyFrame:If we added select, then we'd need two arguments to be able to say whether we want pl.DataFrame or pl.LazyFrame
def select(
*exprs: IntoExpr | Iterable[IntoExpr],
backend: IntoBackend[Backend],
eager: bool = True,
**named_exprs: IntoExpr,
) -> DataFrame | LazyFrame: ...But that also seems a bit footgun-y, since something like select(..., backend="duckdb") would have a conflicting default.
What type of PR is this? (check all applicable)
Checklist
If you have comments or can explain your changes, please do so below
Related issue: #2722
I am slightly concerned that this could turn out to be a month long PR. I will open it as draft for now. There are a couple of pain points I already know:
I was very careful with type hints, yet it somehow doesn't pass type checkerFixedeager: bool. In our case it's a bit more complex than that. Instead of adding yet another argument to the function, I allowed foreagerto be False (default), None or the backend/implementation that should back the series.